White Wine Quality Exploration

by Yi (Michelle) Deng

========================================================

This report explores a white wine dataset containing quality evaluations and attributes for 4898 white wines.

Univariate Plots Section

## [1] 4898   13
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

This dataset consists of 12 variables, with 4898 observations.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000
## [1] 20
## [1] 5

The quality scores range from 3 to 9. Most wines have a score of 5 to 7. The quality distribution appears normal with the peak of 6. Twenty worst wines are scored 3, and only 5 best wines are scored 9.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

There are a few wines with extremely high fix.acidity. After omitting the top 0.1% values, the distribution of fix.acidity appears normal, with the peak around 6.75.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

There are a few wines with extremely high volatile.acidity. After omitting the top 1% values, the distribution of volatile.acidity appears normal, with the peak around 2.50.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

There are a few wines with extremely high citric.acidity. After omitting the top 1% values, the distribution of citric.acidity appears normal, with the peak around 0.3. It is noted that there is another sharp peak at 0.49, with more than 200 counts, which are all scored as 6.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Transformed the long tail data to better understand the distribution of residual.sugar. The tranformed residual.sugar distribution appears bimodal with the peak around 1 or so and again at 8 or so.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Transformed the long tail data to better understand the distribution of chlorides. The tranformed chlorides distribution appears normal with the peak around 0.05 or so.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

There are a few wines with extremely high free.sulfur.dioxide. After omitting the top 1% values, the distribution of free.sulfur.dioxide appears normal, with the peak around 30.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

After omitting the top 0.1% values, the distribution of total.sulfur.dioxide appears normal, with the peak around 125.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

After omitting the top 0.1% values, the distribution of density appears normal, with the peak around 0.992 to 0.995.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

The distribution of pH appears normal, with the peak around 3.15.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

The distribution of sulphates appears normal, with the peak around 0.5.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Univariate Analysis

What is the structure of your dataset?

There are 4898 white wines in the dataset with 12 features (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol and quality).

Most white wines are scored 6. Twenty worst wines are scored 3, and only 5 best wines are scored 9.

The median fixed.acidity is 6.80, ranging from 3.80 to 14.20 The median volatile.acidity is 0.26, ranging from 0.08 to 1.10 The median citric.acid is 0.32, ranging from 0 to 1.66 The median residual.sugar is 5.20, ranging from 0.60 to 65.80 The median chlorides is 0.43, ranging from 0.01 to 0.35 The median free.sulfur.dioxide is 34.00, ranging from 2.00 to 289.00 The median total.sulfur.dioxide is 134.00 ranging 9.00 to 440.00 The median density is 0.994, ranging from 0.987 to 1.039 The median pH is 3.18, ranging from 2.72 to 3.82 The median sulphates is 0.47, ranging from 0.22 to 1.08 The median alcohol is 10.40, ranging from 8.00 to 14.20

What is/are the main feature(s) of interest in your dataset?

The main feature in the data set is quality. The quality rating is an evaluation outcome feature of each white wine. I’d like to determine which features are best for predicting the quality of a white wine. Since all other features are continuos variables, it is hard to say which one is a better candidate at this moment.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

All other 11 features will likely contribute to the quality of a white wine. They will be further exam in the following bivariate and multivariate analyses.

Did you create any new variables from existing variables in the dataset?

No.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I log-transformed the right skewed residual.sugar and chlorides distributions.

The tranformed distribution for residual.sugar appears bimodal with the residual.sugar peaking around 1 or so and again at 8 or so. There are few white wines with log10(residual.sugar) at around 0.5.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity  citric.acid
## fixed.acidity           1.00000000      -0.02269729  0.289180698
## volatile.acidity       -0.02269729       1.00000000 -0.149471811
## citric.acid             0.28918070      -0.14947181  1.000000000
## residual.sugar          0.08902070       0.06428606  0.094211624
## chlorides               0.02308564       0.07051157  0.114364448
## free.sulfur.dioxide    -0.04939586      -0.09701194  0.094077221
## total.sulfur.dioxide    0.09106976       0.08926050  0.121130798
## density                 0.26533101       0.02711385  0.149502571
## pH                     -0.42585829      -0.03191537 -0.163748211
## sulphates              -0.01714299      -0.03572815  0.062330940
## alcohol                -0.12088112       0.06771794 -0.075728730
## quality                -0.11366283      -0.19472297 -0.009209091
##                      residual.sugar   chlorides free.sulfur.dioxide
## fixed.acidity            0.08902070  0.02308564       -0.0493958591
## volatile.acidity         0.06428606  0.07051157       -0.0970119393
## citric.acid              0.09421162  0.11436445        0.0940772210
## residual.sugar           1.00000000  0.08868454        0.2990983537
## chlorides                0.08868454  1.00000000        0.1013923521
## free.sulfur.dioxide      0.29909835  0.10139235        1.0000000000
## total.sulfur.dioxide     0.40143931  0.19891030        0.6155009650
## density                  0.83896645  0.25721132        0.2942104109
## pH                      -0.19413345 -0.09043946       -0.0006177961
## sulphates               -0.02666437  0.01676288        0.0592172458
## alcohol                 -0.45063122 -0.36018871       -0.2501039415
## quality                 -0.09757683 -0.20993441        0.0081580671
##                      total.sulfur.dioxide     density            pH
## fixed.acidity                 0.091069756  0.26533101 -0.4258582910
## volatile.acidity              0.089260504  0.02711385 -0.0319153683
## citric.acid                   0.121130798  0.14950257 -0.1637482114
## residual.sugar                0.401439311  0.83896645 -0.1941334540
## chlorides                     0.198910300  0.25721132 -0.0904394560
## free.sulfur.dioxide           0.615500965  0.29421041 -0.0006177961
## total.sulfur.dioxide          1.000000000  0.52988132  0.0023209718
## density                       0.529881324  1.00000000 -0.0935914935
## pH                            0.002320972 -0.09359149  1.0000000000
## sulphates                     0.134562367  0.07449315  0.1559514973
## alcohol                      -0.448892102 -0.78013762  0.1214320987
## quality                      -0.174737218 -0.30712331  0.0994272457
##                        sulphates     alcohol      quality
## fixed.acidity        -0.01714299 -0.12088112 -0.113662831
## volatile.acidity     -0.03572815  0.06771794 -0.194722969
## citric.acid           0.06233094 -0.07572873 -0.009209091
## residual.sugar       -0.02666437 -0.45063122 -0.097576829
## chlorides             0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide   0.05921725 -0.25010394  0.008158067
## total.sulfur.dioxide  0.13456237 -0.44889210 -0.174737218
## density               0.07449315 -0.78013762 -0.307123313
## pH                    0.15595150  0.12143210  0.099427246
## sulphates             1.00000000 -0.01743277  0.053677877
## alcohol              -0.01743277  1.00000000  0.435574715
## quality               0.05367788  0.43557472  1.000000000

The quality tends to postively correlate with alcohol, and negatively correlate with density. The total.sulfur.dioxide, residual.sugar, density, alcohol tend to correlate with each other. The higher the alcohol, then the lower the density, the lower the residual.sugar, the lower the total.sulfur.dioxide. The total.sulfur.dioxide also tends to positively correlate with the free.sulfur.dioxide. The pH tends to negatively correlate with the fixed.acidity.

From a subset of the data, only alcohol and density seems to moderately correlate with quality. However, since other features like total.sulfur.dioxide, residual.sugar, density, alcohol tend to correlate with each other, I would like to take a closer look at scatter plots of these inter-correlated features.

It is hard to see the relationship between alcohol and quality from the scatter plot. Therefore, I put the quality into an ordered factor. The relationship appears to be nonlinear, with a drop at score 5.

## 
## Call:
## lm(formula = quality ~ alcohol, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5317 -0.5286  0.0012  0.4996  3.1579 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.582009   0.098008   26.34   <2e-16 ***
## alcohol     0.313469   0.009258   33.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7973 on 4896 degrees of freedom
## Multiple R-squared:  0.1897, Adjusted R-squared:  0.1896 
## F-statistic:  1146 on 1 and 4896 DF,  p-value: < 2.2e-16

Despite the fact that the relationship looks nonlinear, based on the R^2 value, alcohol explains about 19 percent of the variance in quality.

## 
## Call:
## lm(formula = quality ~ density, data = wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1441 -0.6258  0.0005  0.5162  4.2102 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   96.277      4.003   24.05   <2e-16 ***
## density      -90.942      4.027  -22.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8429 on 4896 degrees of freedom
## Multiple R-squared:  0.09432,    Adjusted R-squared:  0.09414 
## F-statistic: 509.9 on 1 and 4896 DF,  p-value: < 2.2e-16

The relationship appears to be nonlinear. However, based on the R^2 value, density explains about 9 percent of the variance in quality.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$total.sulfur.dioxide and wine$residual.sugar
## t = 30.669, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3776791 0.4246712
## sample estimates:
##       cor 
## 0.4014393

## 
##  Pearson's product-moment correlation
## 
## data:  wine$total.sulfur.dioxide and wine$free.sulfur.dioxide
## t = 54.645, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5977994 0.6326026
## sample estimates:
##      cor 
## 0.615501

## 
##  Pearson's product-moment correlation
## 
## data:  wine$total.sulfur.dioxide and wine$density
## t = 43.719, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5094349 0.5497297
## sample estimates:
##       cor 
## 0.5298813

## 
##  Pearson's product-moment correlation
## 
## data:  wine$total.sulfur.dioxide and wine$alcohol
## t = -35.15, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4709775 -0.4262443
## sample estimates:
##        cor 
## -0.4488921

## 
##  Pearson's product-moment correlation
## 
## data:  wine$residual.sugar and wine$density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665

## 
##  Pearson's product-moment correlation
## 
## data:  wine$residual.sugar and wine$alcohol
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4726723 -0.4280267
## sample estimates:
##        cor 
## -0.4506312

## 
##  Pearson's product-moment correlation
## 
## data:  wine$density and wine$alcohol
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376

## 
##  Pearson's product-moment correlation
## 
## data:  wine$pH and wine$fixed.acidity
## t = -32.934, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4485154 -0.4026542
## sample estimates:
##        cor 
## -0.4258583

After excluding the top 0.1% values of each feature, these scatter plots indicate inter-correlations among chemical properties, which may together influence the quality of white wines.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

There is no strong direct correlation between the quality and other features. Alcohol tends to positively correlate with the quality, with a moderate correlation coefficience (r=0.436). Density tends to negatively correlate with the quality, with correlation coefficience equals to -0.307.

Based on the R^2 value, alcohol explains about 19 percent of the variance in quality, while density explains about 9 percent of the variance. Other features of interest can be incorporated into the model to explain other variance in the quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The total.sulfur.dioxide, residual.sugar, density and alcohol inter-correlate with each other. The higher the alcohol, then the lower the density, the lower the residual.sugar, the lower the total.sulfur.dioxide.

The total.sulfur.dioxide also positively correlates with the free.sulfur.dioxide. The higher the total.sulfur.dioxide, then the higher the free.sulfur.dioxide, which makes sense.

The pH negatively correlates with the fixed.acidity. The lower the pH, then the higher the fixed.acidity, which makes sense.

What was the strongest relationship you found?

The residual.sugar is strongly and postively correlated with the density (r= 0.839). The density is strongly and negatively correlated with the alcohol (r= -0.780).

Multivariate Plots Section

Levels of quality cluster by alcohol and density values. In general, higher quality scores locate at the top left, with higher alcohol value and lower density value.

When adding the quality against the residual.sugar vs. density relationship, I notice that if we account for constant density value, higher residual.sugar value associates with a higher quality score.

Levels of quality cluster by alcohol and residual.sugar values. In general, higher quality scores locate at the bottom right, with higher alcohol value and lower residual.sugar value.

Quality does not correlate with pH and fixed.acidity. Nothing particularly stands out.

A linear model using those variables may be useful to predict the quality of a a white wine.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = subset(wine, alcohol <= 
##     quantile(wine$alcohol, 0.999)))
## m2: lm(formula = quality ~ alcohol + density, data = subset(wine, 
##     alcohol <= quantile(wine$alcohol, 0.999)))
## m3: lm(formula = quality ~ alcohol + density + residual.sugar, data = subset(wine, 
##     alcohol <= quantile(wine$alcohol, 0.999)))
## m4: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity, 
##     data = subset(wine, alcohol <= quantile(wine$alcohol, 0.999)))
## m5: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity + 
##     chlorides, data = subset(wine, alcohol <= quantile(wine$alcohol, 
##     0.999)))
## m6: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity + 
##     chlorides + total.sulfur.dioxide, data = subset(wine, alcohol <= 
##     quantile(wine$alcohol, 0.999)))
## m7: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity + 
##     chlorides + total.sulfur.dioxide + fixed.acidity, data = subset(wine, 
##     alcohol <= quantile(wine$alcohol, 0.999)))
## m8: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity + 
##     chlorides + total.sulfur.dioxide + fixed.acidity + sulphates, 
##     data = subset(wine, alcohol <= quantile(wine$alcohol, 0.999)))
## m9: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity + 
##     chlorides + total.sulfur.dioxide + fixed.acidity + sulphates + 
##     pH, data = subset(wine, alcohol <= quantile(wine$alcohol, 
##     0.999)))
## m10: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity + 
##     chlorides + total.sulfur.dioxide + fixed.acidity + sulphates + 
##     pH + citric.acid, data = subset(wine, alcohol <= quantile(wine$alcohol, 
##     0.999)))
## m11: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity + 
##     chlorides + total.sulfur.dioxide + fixed.acidity + sulphates + 
##     pH + free.sulfur.dioxide, data = subset(wine, alcohol <= 
##     quantile(wine$alcohol, 0.999)))
## 
## ==============================================================================================================================================================
##                            m1          m2          m3          m4          m5          m6          m7          m8          m9           m10          m11      
## --------------------------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)            2.582***  -22.510***   90.296***   74.262***   73.304***   81.387***   60.364***   86.100***   162.880***   163.355***   150.001***  
##                         (0.098)     (6.169)    (12.377)    (11.979)    (12.001)    (12.248)    (14.113)    (14.728)     (18.574)     (18.612)     (18.765)    
##   alcohol                0.313***    0.360***    0.246***    0.286***    0.282***    0.283***    0.305***    0.274***     0.183***     0.182***     0.194***  
##                         (0.009)     (0.015)     (0.018)     (0.018)     (0.018)     (0.018)     (0.019)     (0.020)      (0.024)      (0.024)      (0.024)    
##   density                           24.746***  -87.870***  -71.580***  -70.543***  -78.816***  -57.521***  -83.546***  -163.031***  -163.516***  -150.085***  
##                                     (6.083)    (12.320)    (11.925)    (11.951)    (12.211)    (14.126)    (14.756)     (18.844)     (18.884)     (19.034)    
##   residual.sugar                                 0.053***    0.052***    0.052***    0.053***    0.045***    0.056***     0.087***     0.087***     0.081***  
##                                                 (0.005)     (0.005)     (0.005)     (0.005)     (0.005)     (0.006)      (0.007)      (0.007)      (0.008)    
##   volatile.acidity                                          -2.062***   -2.047***   -2.080***   -2.096***   -2.049***    -1.969***    -1.960***    -1.870***  
##                                                             (0.109)     (0.110)     (0.110)     (0.110)     (0.110)      (0.110)      (0.112)      (0.112)    
##   chlorides                                                             -0.696      -0.773      -0.861      -0.790       -0.156       -0.180       -0.236     
##                                                                         (0.540)     (0.540)     (0.540)     (0.538)      (0.544)      (0.547)      (0.543)    
##   total.sulfur.dioxide                                                               0.001**     0.001**     0.001*       0.001*       0.001*      -0.000     
##                                                                                     (0.000)     (0.000)     (0.000)      (0.000)      (0.000)      (0.000)    
##   fixed.acidity                                                                                 -0.045**    -0.029        0.067**      0.066**      0.066**   
##                                                                                                 (0.015)     (0.015)      (0.021)      (0.021)      (0.021)    
##   sulphates                                                                                                  0.593***     0.637***     0.635***     0.631***  
##                                                                                                             (0.101)      (0.101)      (0.101)      (0.100)    
##   pH                                                                                                                      0.708***     0.711***     0.685***  
##                                                                                                                          (0.105)      (0.105)      (0.105)    
##   citric.acid                                                                                                                          0.039                  
##                                                                                                                                       (0.096)                 
##   free.sulfur.dioxide                                                                                                                               0.004***  
##                                                                                                                                                    (0.001)    
## --------------------------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                  0.2        0.2         0.2         0.3         0.3         0.3         0.3         0.3          0.3          0.3          0.3    
##   adj. R-squared             0.2        0.2         0.2         0.3         0.3         0.3         0.3         0.3          0.3          0.3          0.3    
##   sigma                      0.8        0.8         0.8         0.8         0.8         0.8         0.8         0.8          0.8          0.8          0.8    
##   F                       1142.0      581.1       432.6       437.5       350.4       294.3       253.9       228.1        209.6        188.6        191.3    
##   p                          0.0        0.0         0.0         0.0         0.0         0.0         0.0         0.0          0.0          0.0          0.0    
##   Log-likelihood         -5838.0    -5829.7     -5775.4     -5602.6     -5601.8     -5596.6     -5592.1     -5574.8      -5552.2      -5552.1      -5542.4    
##   Deviance                3112.3     3101.8      3033.7      2827.0      2826.0      2820.0      2814.8      2795.0       2769.3       2769.2       2758.2    
##   AIC                    11682.0    11667.5     11560.9     11217.3     11217.6     11209.2     11202.2     11169.7      11126.4      11128.3      11108.8    
##   BIC                    11701.5    11693.5     11593.3     11256.3     11263.1     11261.2     11260.7     11234.6      11197.9      11206.2      11186.8    
##   N                       4896       4896        4896        4896        4896        4896        4896        4896         4896         4896         4896      
## ==============================================================================================================================================================
## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = subset(wine, alcohol <= 
##     quantile(wine$alcohol, 0.999)))
## m2: lm(formula = quality ~ alcohol + density, data = subset(wine, 
##     alcohol <= quantile(wine$alcohol, 0.999)))
## m3: lm(formula = quality ~ alcohol + density + residual.sugar, data = subset(wine, 
##     alcohol <= quantile(wine$alcohol, 0.999)))
## m4: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity, 
##     data = subset(wine, alcohol <= quantile(wine$alcohol, 0.999)))
## m5: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity + 
##     fixed.acidity, data = subset(wine, alcohol <= quantile(wine$alcohol, 
##     0.999)))
## m6: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity + 
##     fixed.acidity + sulphates, data = subset(wine, alcohol <= 
##     quantile(wine$alcohol, 0.999)))
## m7: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity + 
##     fixed.acidity + sulphates + pH, data = subset(wine, alcohol <= 
##     quantile(wine$alcohol, 0.999)))
## m8: lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity + 
##     fixed.acidity + sulphates + pH + free.sulfur.dioxide, data = subset(wine, 
##     alcohol <= quantile(wine$alcohol, 0.999)))
## 
## ========================================================================================================================
##                           m1          m2          m3          m4          m5          m6          m7           m8       
## ------------------------------------------------------------------------------------------------------------------------
##   (Intercept)           2.582***  -22.510***   90.296***   74.262***   52.915***   81.496***   157.578***   154.204***  
##                        (0.098)     (6.169)    (12.377)    (11.979)    (13.742)    (14.452)     (18.137)     (18.106)    
##   alcohol               0.313***    0.360***    0.246***    0.286***    0.310***    0.277***     0.182***     0.193***  
##                        (0.009)     (0.015)     (0.018)     (0.018)     (0.019)     (0.020)      (0.024)      (0.024)    
##   density                          24.746***  -87.870***  -71.580***  -49.973***  -78.888***  -157.620***  -154.388***  
##                                    (6.083)    (12.320)    (11.925)    (13.736)    (14.463)     (18.382)     (18.350)    
##   residual.sugar                                0.053***    0.052***    0.045***    0.056***     0.087***     0.083***  
##                                                (0.005)     (0.005)     (0.005)     (0.006)      (0.007)      (0.007)    
##   volatile.acidity                                         -2.062***   -2.084***   -2.039***    -1.945***    -1.890***  
##                                                            (0.109)     (0.109)     (0.109)      (0.109)      (0.110)    
##   fixed.acidity                                                        -0.047**    -0.030*       0.065**      0.068***  
##                                                                        (0.015)     (0.015)      (0.020)      (0.020)    
##   sulphates                                                                         0.620***     0.660***     0.628***  
##                                                                                    (0.100)      (0.100)      (0.100)    
##   pH                                                                                             0.713***     0.695***  
##                                                                                                 (0.104)      (0.103)    
##   free.sulfur.dioxide                                                                                         0.003***  
##                                                                                                              (0.001)    
## ------------------------------------------------------------------------------------------------------------------------
##   R-squared                 0.2        0.2         0.2         0.3         0.3         0.3          0.3          0.3    
##   adj. R-squared            0.2        0.2         0.2         0.3         0.3         0.3          0.3          0.3    
##   sigma                     0.8        0.8         0.8         0.8         0.8         0.8          0.8          0.8    
##   F                      1142.0      581.1       432.6       437.5       352.6       302.5        268.5        239.1    
##   p                         0.0        0.0         0.0         0.0         0.0         0.0          0.0          0.0    
##   Log-likelihood        -5838.0    -5829.7     -5775.4     -5602.6     -5597.7     -5578.6      -5555.0      -5542.8    
##   Deviance               3112.3     3101.8      3033.7      2827.0      2821.2      2799.4       2772.5       2758.7    
##   AIC                   11682.0    11667.5     11560.9     11217.3     11209.3     11173.3      11128.0      11105.6    
##   BIC                   11701.5    11693.5     11593.3     11256.3     11254.8     11225.3      11186.5      11170.5    
##   N                      4896       4896        4896        4896        4896        4896         4896         4896      
## ========================================================================================================================
## 
## Call:
## lm(formula = quality ~ alcohol + density + residual.sugar + volatile.acidity + 
##     fixed.acidity + sulphates + pH + free.sulfur.dioxide, data = subset(wine, 
##     alcohol <= quantile(wine$alcohol, 0.999)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8240 -0.4942 -0.0403  0.4667  3.1206 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.542e+02  1.811e+01   8.517  < 2e-16 ***
## alcohol              1.928e-01  2.411e-02   7.998 1.57e-15 ***
## density             -1.544e+02  1.835e+01  -8.414  < 2e-16 ***
## residual.sugar       8.286e-02  7.289e-03  11.368  < 2e-16 ***
## volatile.acidity    -1.890e+00  1.096e-01 -17.242  < 2e-16 ***
## fixed.acidity        6.827e-02  2.044e-02   3.340 0.000843 ***
## sulphates            6.276e-01  1.000e-01   6.273 3.84e-10 ***
## pH                   6.948e-01  1.034e-01   6.720 2.02e-11 ***
## free.sulfur.dioxide  3.347e-03  6.767e-04   4.946 7.82e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7513 on 4887 degrees of freedom
## Multiple R-squared:  0.2813, Adjusted R-squared:  0.2801 
## F-statistic: 239.1 on 8 and 4887 DF,  p-value: < 2.2e-16

The variables in this linear model can account for 28.0% of the variance in the quality of white wines.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

In general, higher quality scores associate with higher alcohol values, lower density values and lower residual.sugar values. However, when holding density value constant, white wines with higher quality scores are most likely the ones have higher residual.sugar values. Since each chemical property does not have a very strong relationship with the quality scores, it suggests me to try a linear model by adding in all these variables as a start, then to see which ones will play significant roles in the model.

Were there any interesting or surprising interactions between features?

The quality scores do not cluster by pH and fixed.acidity.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I created a linear model starting from the quality and alcohol. The alcohol can account for 18.9% of the variance in the quality of white wines. When adding in other chemical variables, the final model containing 8 chemical variables can account for 28.0% of the variance in the quality of white wines.


Final Plots and Summary

Plot One

Description One

After omitting the top 0.1% values, the distribution of density appears to be normal, ranging from 0.987 to 1.002, with the peak around 0.992 to 0.995.

Plot Two

Description Two

White wines with the highest quality scores have the highest alcohol level, and the lowest denstiy. The alcohol variance is larger in the wines which are scored as 6, 7 and 8. The density variance is larger in the wines which are scored as 5 and 6. The wines which are scored as 3 or 4 do not show difference in the alcohol level and density.

Plot Three

Description Three

When holding density value constant, white wines with higher residual sugar level more likely have higher quality scores. The plot indicates that a linear model might be built to predict the quality of white wine, if including density and residual sugar levels as predictor variables.


Reflection

The white wine data set contains almost 5000 white wines across 12 variables. The quality evaluation score is the outcome variable, while the other 11 chemical property variables are treated as candidate predictor variables. In order to understand which chemical properties may influence the quality of white wines, I started by understanding the distribution of each variables, and then explored the relationships between each pair of interested variables. Eventurally, I built a linear model using 8 out of 11 chemical variables. This model can account for 28.0% of the variance in the quality of white wines.

There was no a clear strong trend between quality and each chemical variable. The highest correlation coefficient was 0.436 in alcohol. Therefore, it was hard to pick the feature(s) of interest at the beginning. I started from pair-wise plots, noticing some chemical properties were inter-correlated, such as density, alcohol, residual sugar and total.sulfur.dioxide. I struggled understanding the relationship of quality and these inter-correlated chemical properties. After transforming the quality into an ordered factor, the retionships were more clear on the multivariate plots.

My final linear model only be able to account for 28.0% of the variance in the quality of white wine. The predictive power of this model is weak. Given that the quality score can be treated as an ordered factor, ordinal regression or other predictive models, such as machine learning, may be a better option for this data set.